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ABSTRACT 

Cys2-His2 zinc finger proteins (ZFPs) are tlie largest 
family of transcription factors in higher metazoans. 
They also represent the most diverse family with 
regards to the composition of their recognition 
sequences. Although there are a number of ZFPs 
with characterized DNA-binding preferences, the 
specificity of the vast majority of ZFPs is unknown 
and cannot be directly inferred by homology due to 
the diversity of recognition residues present within 
individual fingers. Given the large number of unique 
zinc fingers and assemblies present across eukary- 
otes, a comprehensive predictive recognition model 
that could accurately estimate the DNA-binding 
specificity of any ZFP based on its amino acid 
sequence would have great utility. Toward this 
goal, we have used the DNA-binding specificities 
of 678 two-finger modules from both natural and 
artificial sources to construct a random forest- 
based predictive model for ZFP recognition. We 
find that our recognition model outperforms previ- 
ously described determinant-based recognition 
models for ZFPs, and can successfully estimate 
the specificity of naturally occurring ZFPs with 
previously defined specificities. 



INTRODUCTION 

Defining the grammar underlying the transcriptional regu- 
latory elements within the human genome remains a 
critical step in understanding both developmental and 
disease processes (1). The advent of high-throughput 
sequencing technology has fueled the development of 
methodologies for the genome-wide characterization of 
regulatory features, such as global histone modifications 
(1-10). These data coupled with global analysis of RNA 
transcript levels (6,11), chromatin immunoprecipitation 
(ChlP)-based occupancy data for sequence-specific 
transcription factors (TFs) (7,12-14) and chromatin con- 
formational capture techniques (15) provide a framework 
for deconvoluting regulatory networks directing gene ex- 
pression patterns (16,17). Currently, only a small subset of 
human TFs has been characterized by ChlP-based 
approaches in any given cell line (7,13,14), although 
some sequence occupancy can be inferred from DNasel 
(12,17) and MNase (18) data. In the absence of genome- 
wide binding data, knowledge of the DNA-binding 
specificities of the TFs within regulatory networks in 
concert with data sets on sequence conservation, chroma- 
tin accessibihty and histone modifications can be exploited 
by computational algorithms to predict TF genomic occu- 
pancy, and thereby construct more elaborate 
transcriptional regulatory models (1,9,17,19-24). Given 
the difficulty in characterizing the diverse binding 
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patterns of all expressed TFs in all possible temporal and 
spatial expression patterns in vertebrates, the ability to 
estimate the specificity of the constellation of TFs ex- 
pressed at any given time in a given cell type provides a 
critical data set for constructing these regulatory models. 

Cys2-His2 zinc finger proteins (ZFPs) are the largest 
class of TFs within most metazoans (25), with an 
estimated 675 members in the human genome (26) harbor- 
ing an average of 8.5 finger units per gene (27). The 
majority of these ZFPs are believed to be involved in 
DNA-recognition, as many of the neighboring fingers 
are connected by a Kriippel-type TGE(K/R)P Hnker, 
which is a hallmark of DNA-binding fingers (28). The 
canonical DNA-recognition model for an individual 
finger is based on the ZFP-DNA co-crystal structure of 
Zif268 (29,30) and other naturally occurring and engin- 
eered ZFPs (31-35), wherein each finger potentially recog- 
nizes a 4-bp subsite that overlaps the recognition site of 
the neighboring N- and C-terminal fingers by 1 bp 
(Figure lA). Amino acid residues at positions —1, +2, 
+3 and +6 of the recognition hehx typically mediate the 
recognition preference of a finger within its subsite. The 
target site preference of a tandem array of fingers reflects a 
complex interaction between the individual finger 
modules, as the recognition properties of an individual 
finger can be influenced by its position within an array 
and the recognition determinants displayed by its imme- 
diate neighbors (36-41). 

DNA-binding specificities have been determined for 
only a small fraction of ZFPs in metazoan genomes 
(13,17,26,47-50). Unhke other TF famihes where the 
majority of the resident factors in diverse species share a 
high degree of homology (26,51-54), evolutionary analysis 
of ZFPs indicates that a substantial fraction of resident 
members do not have highly conserved homologs across 
metazoans. Instead, the number and composition of 
fingers within these ZFPs is dynamic between species 
(27,55,56) and can even vary within a species [e.g. the 
variation in human PRDM9 isoforms (57,58)]. The speci- 
ficity determinants within these ZFPs are under strong 
positive selection, implying the rapid diversification of 
their recognition potential (27). Consequently, naturally 
occurring ZFPs can specify a wide variety of different 
DNA sequences based on both the number and compos- 
ition of fingers within the array. 

Although some principles that govern the recognition 
properties of zinc fingers have been established, the 
accurate prediction of their DNA-binding specificity 
remains challenging. Specificity determinants at individual 
recognition helix positions with defined base preferences 
have been extracted from the biochemical and struc- 
tural characterization of naturally occurring ZFPs 
(42,47,49,50,59-61) and the selection and characterization 
of artificial ZFPs that recognize novel target sequences 
(37,38,41,44,62-74). These data provide a foundation for 
the construction of predictive recognition models that 
estimate DNA-binding specificity based on the sequence 
of the recognition helix of each incorporated finger. Initial 
models focused on using the amino acid identity at key 
determinant positions (—1, +2, +3 and +6) to estimate the 
base preference at their primary DNA contact positions 
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Figure 1. (A) Schematic representation of the canonical recognition 
pattern of two zinc fingers recognizing a hexamer sequence. Each 
zinc finger unit spans ~30 amino acids and folds into a ppa-motif 
around a tetrahedrally coordinated zinc ion (42,43). DNA-binding 
specificity is typically mediated by residues at positions —1, +2, +3 
and +6 of the recognition helix, where the numbering scheme refers 
to the position of each residue relative to the start of the a-helix. 
The boxed base pair (N4) represents the position of potential 
recognition overlap in the canonical recognition model. (B) Schematic 
representation of the two-stage process used to identify two-finger 
modules with the desired sequence preference. In Stage 1, the B2H 
system is used to select two-finger modules from an OPEN-based 
library, where the finger pools used correspond to the finger 2 (F2) 
and finger 3 (F3) subsites in each target site (44,45). These two-finger 
libraries are selected in the context of a constant finger 1 (Fl) module 
that recognizes GCG in the neighboring subsite. The DNA-binding 
specificity of active clones recovered from the B2H selection was 
determined using the BIH system using a 6-bp randomized library 
adjacent to the constant GCG Fl binding site. The recovered binding 
sites are determined by Illumina sequencing and then a binding site 
motif is calculated from these sequences (46). 
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within the DNA subsite bound by each individual finger 
(75-77). Recently, more advanced predictive models have 
been constructed with improved performance that incorp- 
orate context-dependent recognition, which allows deter- 
minants to influence more binding site positions than 
prescribed by the standard recognition model (76-82). 
However, the construction of these models has been 
hampered by the limited amount of existing quantitative 
specificity data for ZFPs that links individual fingers with 
recognition of particular subsites. 

A comprehensive recognition model for canonically 
binding ZFPs should be achievable using the growing 
archive of quantitative specificity data from recent bacter- 
ial one-hybrid (BIH) analysis of a large number of artifi- 
cial (41,62,71) and naturally occurring ZFPs (49,50), 
where the position of each finger within the recognition 
sequence is defined or can be inferred. This data set spans 
678 two-finger modules, including the characterization of 
95 two-finger modules generated using the Oligomerized 
Pool ENgineering (OPEN) system (44,45) described 
herein. A sizeable fraction of these data explicitly 
examines the impact of recognition residues at the 
finger-finger interface on the preferred specificity at 
the junction of the finger binding sites, which remains 
the most challenging recognition feature to model. These 
data permit an improved estimation of context-dependent 
effects requiring the use of predictive models [such as 
support vector machine (83) or random forests (RFs) 
(84)] that implicitly capture these complex properties. 
Building on our previous efforts using RF models to 
estimate the specificity of homeodomains (85), we have 
constructed an RF predictive model for ZFPs using our 
BIH data that are superior to existing predictive models 
and that can effectively estimate the DNA-binding speci- 
ficity of a number of naturally occurring ZFPs. 



MATERIALS AND METHODS 

OPEN finger selections 

OPEN selections were performed to generate a set of two- 
finger modules that recognize all 64 possible GNNGNG- 
type sequences in the context of an N-terminal 'GCG' 
binding anchor zinc finger (recognition helix: 
RSDTLAR). All target sites used in the selection of 
novel recognition fingers were of the form 
GNNGNGGCG. Zinc finger hbraries for each target 
site were assembled from the corresponding Finger 2 
and Finger 3 OPEN pools as previously described but 
with a fixed Finger 1 module (44,45). OPEN selections 
were performed essentially as previously described 
(44,45) but using a beta-lactamase {bla) antibiotic- 
resistance gene instead of the HIS3 gene (70). For each 
of the 64 selections, we assayed the abihties of up to five 
clones to activate expression of a lacZ reporter gene in a 
bacterial two-hybrid (B2H) system as previously described 
(45) and determined the amino acid sequences of these 
clones. Fifty-eight of the 64 selections displayed active 
clones, from which we chose 95 clones that could 
activate expression of lacZ in the B2H system by 



~2. 5-fold or more for further evaluation via BIH 
binding site selections (Supplementary Table SI). 

CV-BIH method 

To determine binding site specificities of OPEN-selected 
and other 2F-modules, the CV-BIH (Constrained 
Variation Bacterial one-Hybrid) assay was performed es- 
sentially as described previously (46). Two-finger modules 
were evaluated as fusions to the GCG anchor finger. 
Following transformation into the selection strain, 
1 X 10^ cells containing the zinc finger plasmid (1352- 
omega-UV2-ZFP) and the 6-bp randomized binding site 
hbrary (in pH3U3) were plated on selective NM minimal 
medium plates (100 x 15 mm) containing 50 IPTG 
and 1 or 2mM 3-AT and grown at 37°C for 22-30 h. All 
cells on the plate were pooled, and the pH3U3 plasmids 
containing the compatible binding sites were isolated for 
identification of the functional DNA sequences. The 
binding site region was PCR amplified, barcoded and 
sequenced via Illumina sequencing, and then binding 
specificities were determined from these data using 
GRaMS modeling and the log-odds method (46,71,86). 

Construction of the RF ZFP regression model 

Based on a pilot study and previous work with 
homeodomain recognition modehng (85), we developed 
a recognition modeler based on a RF regression 
approach (84) using the 'randomForest' module from 
the R package [http://www.r-project.org/(87)]. Two differ- 
ent ZFP RF regression models were trained based on the 
BIH specificity data: one-finger and two-finger models. 
The training data for the two-finger model consisted of 
678 protein sequences for two fingers of ZFPs and the 
position frequency matrices (PFMs) obtained from the 
BIH experiments described above. The one-finger model 
was trained on the same set but contained 1209 individual 
fingers (redundancy removed. Supplementary Table S2). 
Preliminary analysis showed that including additional 
protein positions beyond the canonical —1, +2, +3 and 
+6 recognition positions in each finger did not improve 
the accuracy of the model, so all further training used only 
those positions. Of the 678 two-finger examples, there are 
530 unique combinations of residues at positions —1, +2, 
+3 and +6; all of them are kept in the data set because the 
PFMs, while similar between repeats, are not identical and 
this maintains the inherent variabihty in the data. These 
models use the RF regression engine that was previously 
described (85). The modeler predicts the PFM for a zinc 
finger protein based on its sequence at the recognition 
positions, and the RF regression minimizes the mean- 
squared error (MSE) between the predicted and 
observed PFMs. MSE values for a single position can 
range from 0, if the two PFMs are identical, to 0.5 if 
they contain probabihties of 1.0 for different bases. A 
random position (probability of 0.25 for each base) 
would have a maximum MSE of 0.1875 compared with 
a position with probability of 1.0 for any base. This has 
the effect of generating PFMs that tend toward random at 
some positions instead of making high probabihty predic- 
tions that are frequently incorrect. 
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We used the default value of 500 trees while training the 
RF model. In this model, a single tree picks predictive 
variables, specific amino acids at specific positions, 
randomly and then appUes regression to estimate their 
contribution to each PFM parameter. The set of individ- 
ual trees are then weighted by regression to minimize the 
overall MSB between the observed and predicted PFMs. 
Accuracies were determined by 10-fold cross-validation, 
where the total data set was divided into 10 subsets and 
training was based on nine of them and the accuracy 
measured on the remaining subset. Each of the subsets 
was left out in turn, and the testing accuracy is reported 
as the means and medians on the test sets. 

We chose to minimize MSB because we are specifically 
trying to find optimal PFMs that fit the entire distribution 
of binding site affinities. However, other objectives could 
be used instead. There have been a large number of dif- 
ferent methods proposed to compare motifs with each 
other and determine a quantitative measure of similarity 
(88-94). The MSB that we use is closely related to 
maximizing the Pearson correlation and is often a highly 
ranked method, particularly when trying to assign a motif 
to a specific class of transcription factors. In other 
approaches more emphasis is put on matching high infor- 
mation content positions in the binding sites and low 
information content positions are scored similar to 
mismatches. For example, the recently published zinc 
finger predictor from the Princeton group (82) specifically 
maximizes the number of correctly predicted positions 
with high information content, which has advantages for 
some purposes (see later in the text). 

Construction of ZFP recognition motif predictions 

We estabhshed a Web site that wiU predict the binding 
motif for an input ZFP containing any number of 
fingers (http://stormo.wustl.edu/ZFModels/). ZFP 
sequences can be submitted in two forms as follows: a 
concatenation of the four critical recognition residues of 
each finger (—1, +2, +3 and +6) or the entire protein 
sequence. In the latter case, the Web site wiU determine 
the locations of the recognition residues in each finger 
based on a HMMER analysis (95) of zinc finger motifs 
present within the sequence. Three different ZFP motif 
generation methods are available based on the trained 
RF regression models: one-finger model, multi-finger 
model and the average of these models. In the one-finger 
model, the predictions are based on training of single 
fingers, and the complete motif is predicted by 
concatenating the individual predictions. In the multi- 
finger model, the predictions are based on the two-finger 
training data, and the complete motif is stitched together 
from the overlapping two-finger predictions, where the 
positions of overlap between the motifs are averaged 
(Supplementary Figure SI). The third method averages 
together the prediction from the one-finger and two- 
finger models to generate the final prediction. Generally, 
the different predictions are in close agreement but 
sometimes there is a divergence and the most accurate 
may depend on the specific zinc finger protein; therefore. 



we advocate testing with each model to examine the 
inherent variation. 

Evaluation of Bci6 predictive motif for predicting 
ChlP-seq peaks 

The predicted DNA-binding specificity of Bcl6 was 
estimated using the multi-finger model through the 
ZFModels interface. The top 100 ChlP-seq peaks for 
Bcl6 (96) were extracted using Galaxy (97), and a motif 
for Bcl6 was extracted from these peaks using MBMB 
(zoops mode) (98). MSB was calculated from this PFM 
against different motifs as described above. FIMO (99) 
was used to determine the number of the top 100 ChIP 
peaks containing favorable Bcl6 binding sites {P < 10""^) 
based on each motif. 

RESULTS 

Selection and characterization of two-finger modules 
recognizing GNNGNG target sites 

We used OPBN selections (44,45) to identify two-finger 
modules recognizing 64 different 6-bp target sites of the 
form GNNGNG (Figure IB). This set of target sites was 
chosen to include a focused set of sequences that were 
available in the OPBN system to explore the quality of 
the B2H-generated fingers. In addition, for the defined 
target positions (constant guanines), there are strong 
expectations about the complementary recognition deter- 
minants that would be selected. Deviations from the 
expected residues in the recovered sequences would be 
indicative of context-dependent effects. These two-finger 
modules were selected via the B2H system in the context 
of a three-finger array harboring a fixed N-terminal 
anchor finger that recognizes a GCG subsite. Fifty-eight 
of these selections yielded zinc finger arrays that bound 
their target site as evidenced by their abihty to activate 
transcription in a B2H lacZ reporter assay 
(Supplementary Table SI). 

We determined the DNA-binding specificity of a repre- 
sentative set of the B2H-selected two-finger modules using 
the BIH system (49,71). Bach two-finger module was 
characterized using a reporter system containing a 6-bp 
randomized binding site Hbrary adjoining the finger 1 
recognition element — GCG (46,71) (Figure IB). After 
selection, surviving colonies carrying the functional 
DNA-sequences for each two-finger module recovered 
from this hbrary were pooled and characterized by 
lUumina sequencing from which a preferred recognition 
motif was determined (46). This analysis yielded 
motifs for 95 OPBN-selected two-finger modules 
(Supplementary Figure S2). For 64 of these two-finger 
modules, the preferred recognition sequence matched the 
expected target site. The remaining modules are comple- 
mentary to their target sequence, but actually prefer a 
related binding site. These modules expand the population 
of characterized two-finger modules for the construction 
of artificial zinc finger arrays, and the coupled specificity 
data provide additional information on the recognition 
potential of specific determinant combinations for the 
construction of improved predictive models. 
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Assessing context dependence in our selected 
two-finger modules 

As a basis set for constructing predictive recognition 
models for ZFPs, we have used quantitative BIH specifi- 
city data on a large group of naturally occurring (49,50) 
and artificial (41,62,71) zinc finger arrays. To facilitate the 
evaluation of DNA-recognition by these zinc fingers, we 
have parsed this data set into 1209 different one-finger 
modules or 678 different two-finger modules. For 
example, a characterized three-finger array is broken 
down into three one-finger modules or two overlapping 
two-finger modules with their associated subsite motifs 
(Supplementary Figure SI). Figure 2 shows the base pref- 
erences at base pair positions 1, 2 and 3 within the core 
subsite (contacted by specificity determinants at positions 
+6, +3 and —1, respectively; see Figure 1) for this data set 
of one-finger modules. In general, the observed amino acid 
to base correlations at each position are consistent with 
previous studies of recognition preferences for zinc finger 
proteins (42,43,50,76-78). The strongest correlations are 
observed at the central base; amino acid changes at 
position +3 in the recognition helix primarily influenced 
recognition at the middle base position of the altered 
finger subsite in our two-finger modules when examined 
over the data set (Supplementary Figure S3). The inde- 
pendence of recognition at this position was previously 
harnessed to expand the recognition diversity of our 
two-finger modules in a directed manner in many 
instances (71). 

Weaker correlations at other positions highlight the role 
of context on specificity. The influence of context depend- 
ence on the DNA-binding specificity of individual fingers 



is apparent from a qualitative analysis of finger sets within 
our data set, particularly at the finger-finger interface for 
a subset of two-finger modules where residues on both 
sides of the interface were randomized to more effectively 
capture these effects (Figure lA) (62,71). For many indi- 
vidual two-finger modules, the base at position 4 is highly 
specified. However, when the preferred specificity at this 
position is binned across the data set based on the type of 
residue at position +6 of the N-terminal finger 
(Figure 3A), some amino acids are associated with each 
of the four bases in different C-terminal finger contexts. 
Glutamate at position +6 provides a notable example, 
where two-finger modules containing this residue display 
distinct preferences for each of the four bases at position 4 
(Figure 3B). The potential influence of residues within the 
C-terminal finger, in particular the residue at position +2, 
on recognition at base position 4 are well documented 
(29,31,38,100). Consistent with the potential influence of 
position +2 on recognition, changes in the residue at 
position +2 in the recognition helix in many instances 
appear to influence neighboring base preference, particu- 
larly at position 4 (Supplementary Figure S4). These data 
highhght the need for a predictive model that can capture 
the influence of each determinant position on multiple 
base positions within the zinc finger recognition sequence. 

RF recognition models for ZFPs 

Zinc fingers have been the focus of several studies on 
quahtative recognition codes [reviewed in (42,43)]. More 
recently, several groups have developed models that 
predict quantitative motifs for zinc finger proteins based 
on the residues present at canonical recognition positions 
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within each finger (76-79). Although superior to purely 
quaUtative recognition codes, their accuracies leave 
considerable room for improvement. These models were 
limited because they were trained primarily on quahtative 
data: collections of proteins and their binding sites with 
high binding affinity, but where the preference of each 
ZFP for its target site relative to other sequences was 
unknown. Our BlH-characterized zinc finger data 
provide a much larger training set with quantitative infor- 
mation about the preferences of different proteins for 
different DNA binding sites, which allows us to train 
new recognition models to obtain higher accuracy predic- 
tions. In pilot studies, we tested the feasibihty of creating 
recognition models using several different machine 
learning algorithms, including neural networks (78), 
support vector machines (83), k-nearest neighbors (101), 
partial linear regression (102) and RF (84). We found that 
RF-based models performed as well or better than those 
of other methods and its implementation was compu- 
tationally less demanding, so we used an RF regression 
algorithm to create a predictive model for ZFPs. The 
results of these preliminary studies were similar to those 
we previously reported for predicting the specificity of 
homeodomain proteins (85). 

We trained RF predictive models on either one-finger or 
two-finger module specificity data, where the latter model is 
designed to capture context-dependent effects between 
neighboring fingers. Training the two-finger model takes 
as input the amino acids at the eight canonical recognition 



positions (—1, +2, +3 and -1-6 of each finger) and builds 
regression trees to predict recognition preference over the 
entire 6-bp binding site. (The one-finger model was 
similarly trained on individual fingers and each 3-bp 
binding site.) Importantly, these models are not restricted 
to the canonical interactions between particular finger 
recognition positions and bases within the binding site, 
unlike many previous recognition models (76,77). Because 
we have a much larger training set than was available for 
previous models, a wider range of potential interactions 
between these recognition positions and the binding site 
are allowed within the model to capture context-dependent 
effects observed within the data. Consequently, each recog- 
nition position within the two-finger module contributes to 
the overall predicted PFM, although the strongest contri- 
butions within the model will be between the most highly 
correlated amino acids and base pairs. 

The objective during model training is to minimize the 
MSB between the observed and predicted PFM values for 
each two-finger module. Table 1 shows the average value 
(both the mean and median with standard deviations) 
obtained in a 10-fold cross-vahdation of our two-finger 
model. This was compared with predictions by each of 
four other published models that were readily available 
for testing (76-79). The MSB is greatly reduced with the 
new ZFModels predictions to less than half for means and 
less than one-third for medians when compared with other 
prior models. The prediction error is fairly evenly 
distributed across the positions of the binding sites 
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(Table 2). Figure 4 displays several examples that are near 
the median value of MSB to show the degree of similarity 
between observed and predicted PFMs. Many of the 
highest accuracy examples contain guanine at positions 1 
and 6 because the training set was biased with fingers 
recognizing guanine at these positions. Figure 4 highlights 
examples deviating from this pattern, demonstrating that 
our ZFModels can generate accurate predictions for a 
wide variety of different types of motifs. As expected, 
the two-finger predictive model can capture the context 
dependence at the finger-finger junction observed in our 
data set, such as the motifs in Figure 3B, whereas the 
one-finger predictive model fails to capture this subtlety 
(Supplementary Figure S5). 

Evaluating the utility of the RF-based zinc finger 
recognition model 

Several published studies have determined specificity of 
ZFPs using SELEX (26,103-105). None of these 
examples were included in the training data and so they 
constitute an independent test set. Supplementary Figure 
S6 contains the logos from the published PFMs for a 
subset of these ZFPs and the logos predicted by 
ZFModels. In every case, the predictions match preferred 
binding sites from the experiments when we take into 
account the variable spacing between neighboring fingers 
due to noncanonical linkers in some instances. However, 
the quantitative models are less consistent than the 
average fits to zinc fingers within our data set via cross- 
validation analysis (Supplementary Table S3). This may 
be due to the SELEX data being evaluated after multiple 
rounds of selection where the resulting PFM is heavily 
weighted toward a subset of the highest affinity sites, 
leading to an over-specified motif. We also compared 
the ZFModels predictions on some of the same data sets 
with the predictions made by a recently pubhshed method 
(zfprinceton.edu) based on support vector machine 



Table 1. MSE for several prediction programs 



Program 


ZFModels'' 


Benos'' 


Kaplan" 


Zifnet*^ 


ZIFIBr 


Mean 
Median 


0.017±0.005 
0.009 ±0.002 


0.044 
0.033 


0.047 
0.035 


0.040 
0.032 


0.072 
0.063 



"This work. Values are mean and standard deviation from 10-fold 

cross-validation. 

''Ref (76). 

■^Ref (77). 

''Ref (78). 

"Ref (79). 



training (83). ZFModels makes more accurate predictions 
as measured by MSE (Supplementary Table S4) on these 
independent test sets than the Princeton model, although 
the Princeton model often contains more matching 
positions with high information content (see Discussion). 

Ideally, our recognition model would also allow predic- 
tion of ZFPs with uncharacterized DNA-binding specifi- 
city throughout the genome. We chose to evaluate its 
predictive utility for Bcl6, as this ZFP has been 
characterized by BIH (50), PBM (47) and SELEX-seq 
(26), which allows a comparison of our predictive motif 
against DNA-binding specificities determined via multiple 
methods, and against ChlP-seq data for this factor (96). 
The Bcl6 recognition motifs produced by BIH, PBM and 
SELEX-seq are all similar, although the SELEX-seq motif 
appears over-specified (Figure 5). We also generated a 
predicted recognition motif for Bcl6 using the Princeton 
SVM model for comparison with our model. The 
Princeton motif has greater information content than 
our ZFmodel motif, but at many positions, the 
Princeton motif predicts a particular base with absolute 
certainty, which much like the SELEX-seq motif suggests 
that it is over-specified. When judged against an independ- 
ent source, a MEME (98) motif from the top 100 Bcl6 
ChlP-seq peaks (96), the BIH and PBM motifs appear 
most similar. The ZFModels multi-finger predictive 
model also shows good similarity to the determined 
motifs (MSE values 0.04 from the MEME-ChIP motif, 
0.05 from either the PBM- or BlH-based motifs, 0.05 
from the Princeton motif and 0.08 from the SELEX-seq 
motif), but it is a bit worse than the average value of <0.01 
in our cross vahdation studies. FIMO analysis (99) of 
these Chip peaks using each motif confirms this assess- 
ment: the MEME-derived motif from the Bcl6 ChIP data 
discovers a good Bcl6 binding site (P < 10~^) in 74 of 100 
peaks, the BIH motif in 56 of 100 peaks, the PBM motif in 
52 of 100, the SELEX-seq motif in 43 of 100, the 
ZFModels predicted motif in 25 of 100 and the 
Princeton motif in 9 of 100, where only four would be 
expected by chance. Thus, our predictive motif has value 
for the discrimination of binding sites within the genome, 
and in this example is superior to the Princeton motif, but 
it can still benefit from the incorporation of additional 
experimental data to improve its quahty. Figure 5 
displays logos in two formats, the original information- 
based method (106) and a PFM-based method where the 
height of each base is proportional to its frequency in the 
model (107). The frequency representation demonstrates 
that even in cases where our model does not make a con- 
fident (high probability and high information content) 



Table 2. MSE for each position, for one-finger and two-finger models (mean/median) 



Nucleotide 
position 


1 


2 


3 


4 


5 


6 


1 finger 

2 fingers 


0.016/0.004 
0.006/0.001 


0.015/0.005 
0.007/0.003 


0.008/0.001 
0.006/0.001 


0.012/0.004 


0.010/0.004 


0.004/0.000 



Note: The reported median values represent the bin the median value falls in, where the bins are 0.001 wide and labeled with the lower value. So if 
the median value is reported as 0.000 that means the median is in the bin between 0.000 and 0.001. These values come from training and testing on 
the complete data rather than from cross- vahdation, resulting in lower values than in Table 1. 
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Observed 

F1 F2 

-1 23 6 -1 23 6 
QGTE-CRNR 



Ac. AC A A 



QTTN-QSTN 



Predicted 

MSE 

0.01129 

0.00761 



A 





QSHQ-QSNQ 



0.00971 



AAAAALaaaaA 



Figure 4. Examples of observed motifs for two-finger modules that are within our data set, and predicted motifs for these fingers using our final 
predictive model. Above each observed motif are the amino acids at the four canonical recognition positions (—1, +2, +3 and +6) for the N-terminal 
and C-terminal fingers. The MSE value between the observed and predicted PFMs is displayed above the predicted motif 



prediction, it generally gets the preferred base correct. 
Combining all of the experimental models with the 
MEME model from the ChlP-seq data, one finds a 
consensus sequence of TTCCTnGAAAG (positions 5-15 
in the alignment). Our model agrees at every position 
except 13, where it prefers G slightly to A, but many of 
those predictions are low confidence. In contrast, the 
Princeton model has more high information content pos- 
itions that match the consensus, but it also contains 
several positions where the preferred base is assigned a 
very low probability. Our model has an overall better fit 
to the other models, as evaluated by MSE and similarities 
to the rank distributions of all possible binding sites, but 
there are some purposes for which maximizing the 
number of high confidence, correct predictions is useful 
(see 'Discussion' section). 

DISCUSSION 

The development of platforms for rapidly characterizing 
the specificity of transcription factors has dramatically 



increased the amount of data that is available for all of 
the major TF families (108), but there are still barriers to 
generating data for all naturally occurring ZFPs. The 
average number of fingers in a human ZFP is 8.5 (27), 
and these polydactyl (i.e. many fingered) ZFPs may have 
complex binding modes due to the presence of independent 
DNA-recognition modules. For example, genome-wide 
Chip analysis of NRSF (109,110), a 9-fiiiger ZFP, re- 
covered two different types of binding sites: a prominent 
motif that contains a juxtaposition of two subsites and a set 
of additional motifs with variable spacing between these 
subsites. Taipale and colleagues noted the difficulty in 
characterizing ZFPs by either SELEX-seq or PBM (26): 
they successfully characterized only 8% of ZFPs and only 
3% with more than eight fingers (26). Similarly, our BIH 
motif set includes only seven naturally occurring ZFPs with 
>8 fingers with a success rate of ~38% of the attempted 
Drosophila ZFP genes (50). With the possibility that poly- 
dactyl ZFPs use different finger sets to bind multiple 
distinct motifs, describing their recognition properties is 
critical to understanding their regulatory mechanisms. 
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Figure 5. Comparison of the MEME motif from the top 100 Bcl6 ChIP peaks (96) with the motif predicted for the five canonically linked fingers by 
ZFModels and the Princeton SVM method (82) and the recognition motifs determined directly for Bcl6 by BIH (50), SELEX-seq (26) and PBM (47). 
The left column displays the motifs as information content, whereas the right column displays the motifs as position frequency plots. The frequency 
of a strong motif match (P < 10^"*) for each motif in the top 100 ChIP peaks as determined by FIMO is indicated above each motif 



The growing body of quantitative specificity data for 
naturally occurring and artificial ZFPs provides a founda- 
tion for the development of improved predictive models for 
this family to help facilitate a broader understanding of 
their function as regulators within the genome, where 
other direct analysis methods may be challenging to use. 

Our efforts to construct an improved predictive model 
have focused on two aspects of the problem as follows: 
expanding the population of quantitatively characterized 
finger modules and using new methods for training 
improved recognition models. We have used OPEN- 
based ZFP selection methods (44,45) to expand our 
existing set of BlH-characterized artificial and naturally 
occurring fingers to 1209 one-finger modules and 678 two- 
finger modules. The latter group captures context-depend- 
ent effects that can occur at the finger-finger interface, 
allowing the construction of recognition models that 
span more than a single finger, thereby providing add- 
itional infonnation on the recognition potential of 
specific determinant combinations for the construction 
of improved predictive models. These finger archives and 
the underlying data also have value in the design of 
artificial ZFPs to recognize specific sequences. Thus, the 
assembly of these modules can be data driven by applying 
'rules' for recognition of particular sequences to estimate 



which assembled finger models are likely to provide the 
desired composite specificities. 

Our assessment of ZFModels shows that the motif 
predictions obtained are superior to previously pubUshed 
predictors. This is Ukely due to our larger and better 
(i.e. quantitative) training sets that allow us to consider 
more interactions, not just the canonical ones that have 
been primarily used in the past. We have also leveraged 
our two-finger module data to extend the model construc- 
tion beyond a one-finger to two-finger units, where the 
two-finger model constructs motifs by assembling inter- 
faces via a stitching assembly (62) to try to minimize 
edge effects of the two-finger module data on the resulting 
motif. This model is accessible to the community though 
our Web site (http://stormo.wustl.edu/ZFModels/). Users 
can input a protein sequence and an HMM-based 
algorithm will extract the determinants in each finger for 
construction of a recognition motif. Users can use either 
the one-finger or multi-finger model, or a hybrid (average) 
of these two models for generating a motif for their factor. 
On an independent test set, the hybrid model performed 
sHghtly better (Supplementary Table S3), although the 
results froin each method are similar. 

There is still room for improvement in our predictive 
model, especially for some classes of C2H2 ZFs with 
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noncanonical linkers that may lead to alternate finger 
sequences or binding modes, but in nearly every case 
tested the predictions are at least partially correct and 
allow for the ahgnment of the individual fingers with the 
segments of the binding motifs that they interact with. A 
recently reported large compendium of zinc finger proteins 
selected for binding to specific DNA sequences (74), and 
then with their specificities determined by BIH, may 
provide additional, more diverse information to improve 
the predictive models further, but this has not been tested 
yet. Currently, predictions from our models are not 
accurate enough on their own to make reliable regulatory 
networks, but may be useful in conjunction with accessi- 
bility data and DNasel footprinting data (12) to identify 
their regulatory sites. They can also aid in assigning 
ZF-TFs to particular motifs that are discovered through 
computational analysis of other genomic features, 
although for that particular problem, the alternative 
S VM approach of the Princeton group (82) will sometimes 
work better. Their approach trains their model to 
maximize the number of high information content 
positions that are correctly predicted. By then applying 
string matching methods, one can sometimes identify a 
ZF-TF that is hkely to bind to a known motif 
[e.g. PRDM9 (58)] in cases where our model may yield a 
less definitive consensus because it may predict many low 
information content positions. In some cases, these 
approaches may also allow us to determine whether only 
a subset of ZFs are used to recognize DNA, or if different 
subsets are used to recognize different classes of binding 
sites, as when ZFFs use alternative modes of binding for 
interacting with different sequences. Given the rapid 
diversification of ZFPs during evolution and the technical 
challenges associated with experimental determination of 
their specificities, the continued refinement of predictive 
models will hkely play an important role in understanding 
the roles of these proteins in transcriptional regulatory 
networks. 
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